Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World

نویسندگان

  • Toshiyuki Takezawa
  • Eiichiro Sumita
  • Fumiaki Sugaya
  • Hirofumi Yamamoto
  • Seiichi Yamamoto
چکیده

Abstract At ATR Spoken Language Translation Research Laboratories, we are building a broad-coverage bilingual corpus to study corpus-based speech translation technologies for the real world. There are three important points to consider in designing and constructing a corpus for future speech translation research. The first is to have a variety of speech samples, with a wide range of pronunciations and speakers. The second is to have data for a variety of situations. The third is to have a variety of expressions. This paper reports our trials and discusses the methodology. First, we introduce a bilingual travel conversation (TC) corpus of spoken languages and a broad-coverage bilingual basic expression (BE) corpus. TC and BE are designed to be complementary. TC is a collection of transcriptions of bilingual spoken dialogues, while BE is a collection of Japanese sentences and their English translations. Whereas TC covers a small domain, BE covers a wide variety of domains. We compare the characteristics of vocabulary and expressions between these two corpora and suggest that we need a much greater variety of expressions. One promising approach might be to collect paraphrases representing various different expressions generated by many people for similar concepts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

EFL Translation Students' Perspective toward Using Bilingual Dictionary in Translation of Polysemous Words

This research presented the use of bilingual dictionary and addressed the EFL translation students' points of view on the use of bilingual dictionary in translating polysemous words (English to Persian). Moreo- ver, it aimed at finding the possible relationship between the effect of using bilingual dictionary by stu- dents in translating polysemous words and their achieved scores. In the study ...

متن کامل

Large scale speech-to-text translation with out-of-domain corpora using better context-based models and domain adaptation

In this paper, we described the process of building a large-scale speech-to-text pipeline. Two target domains, daily conversations and travel-related conversations between two agents, for the English-German language pair (both directions) are examined. The SMT component is built from out-of-domain but freely-available bilingual and monolingual data. We make use of most of the known available re...

متن کامل

An Automatic Speech Translation System for Travel Conversation

We present a speech-to-speech translation system for notebook PC's that helps oral communication between Japanese and English speakers in the various situations in the travel abroad. Due to the high accuracy of the compact continuous speech recognition engine and our lexicalized grammar approach to machine translation that utilizes corpus but is oriented to model general linguistic phenomena as...

متن کامل

Multilingual Mobile-Phone Translation Services for World Travelers

This demonstration introduces two new multilingual translation services for mobile phones. The first translation service provides state-of-the-art text-to-text translations of Japanese as well as English conversational spoken language in the travel domain into 17 languages using statistical machine translation technologies trained automatically from a large-scale multilingual corpus. The second...

متن کامل

Creating corpora for speech-to-speech translation

This paper presents three approaches to creating corpora that we are working on for speech-to-speech translation in the travel conversation task. The first approach is to collect sentences that bilingual travel experts consider useful for people goingto/coming-from another country. The resulting EnglishJapanese aligned corpora are collectively called the basic travel expression corpus (BTEC), w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002